15 NumPy: Working with Arrays
15.1 Introduction
NumPy is a fundamental library for numerical computing in Python, frequently used for scientific research, econometrics, and machine learning. By providing tools to work with large arrays of numeric data efficiently, it forms the computational backbone of the Python scientific ecosystem. These arrays—called NumPy arrays—allow you to store, manipulate, and compute on data more quickly and clearly than Python’s standard containers (lists, tuples, dictionaries).
In typical economics and data science workflows, we often handle large datasets (e.g., panel datasets of countries over many years, simulations with millions of draws, or large cross-sectional data). Operations on such data—like summations, regressions, matrix multiplications, or statistical aggregations—can be very costly if carried out using pure Python loops. NumPy arrays solve this problem by providing contiguous (uninterrupted) memory storage and vectorized operations that run at compiled speeds.
This chapter introduces the major building blocks of NumPy. We will start with a conceptual understanding of what a NumPy array is, how it differs from Python’s built-in structures, and why arrays are so important for efficient computing. We will then explore how to create arrays, reshape them, index and slice them, handle broadcasting and vectorized operations, and save/load data to permanent storage. Our aim is to give you enough understanding of NumPy’s conceptual underpinnings that you can confidently work with the library, setting a foundation for more advanced topics in econometrics and machine learning.
15.2 Importing NumPy
If you use Google Colab or similar platforms, NumPy is already installed. For a local installation, you can install NumPy with:
pip install numpy
By convention, NumPy is imported under the alias np
:
import numpy as np
This alias is a nearly universal standard in the Python scientific computing world.
15.2.1 Submodules
np.random
: For random number generation (uniform, normal, etc.).np.linalg
: For linear algebra operations (e.g., matrix inverses, eigenvalues).
The remainder of this chapter will assume you have run import numpy as np
.
15.3 Motivation and Conceptual Overview
15.3.1 Why NumPy Arrays?
Before diving into the syntax, consider why economists or data scientists use NumPy arrays instead of pure Python lists:
Homogeneity and Speed
A NumPy array can only hold one type of data, typically numeric (e.g., all floats, all integers). Because of this restriction, the array is laid out in memory contiguously. This organization is crucial for efficient numerical computing: it allows vectorized operations at the machine-code level without looping in Python.
In contrast, Python lists can store mixed types and are more flexible, but that flexibility comes at a cost in terms of performance and memory overhead.Vectorized Operations
NumPy allows you to perform mathematical operations over entire arrays with very concise syntax. For instance, if you wanted to add 10 to every element in a list of 1 million integers using pure Python, you would likely write a loop and do it one-by-one. In NumPy, this is a single, optimized operation (arr + 10
) that is often orders of magnitude faster.Ease of Manipulation
Economists commonly work with vectors (1D structures) and matrices (2D structures). NumPy extends easily to higher dimensions (3D, 4D, etc.), which is sometimes required for panel data or more involved data structures. Once you learn the fundamentals of NumPy slicing and indexing, reorganizing, filtering, or reshaping your data becomes straightforward.
15.3.2 Differences from Python Lists
- Memory Structure: NumPy arrays store data in a contiguous block. Python lists store references to objects scattered around memory.
- Shape: A Python list is one-dimensional by default. You can nest lists to mimic multi-dimensional data, but it can become unwieldy. NumPy arrays can be created in 1D, 2D, or higher dimensions, and the library provides utilities to reshape and manipulate these dimensions consistently.
- Fixed Size: A NumPy array’s size is determined upon creation. While you can still concatenate arrays, they are not as freely resizable as Python lists. This constraint helps maintain predictable memory usage and speed.
15.4 Creating and Inspecting Arrays
15.4.1 Array Creation
You create an array by passing a Python list (or list of lists) to np.array()
. Here are some common patterns:
import numpy as np
# 1D array (vector)
= np.array([10, 20, 30, 40])
vector
# 2D array (matrix)
= np.array([
matrix 1, 2, 3],
[4, 5, 6]
[
])
# 3D array (cube-like structure)
= np.array([
cube 1, 2], [3, 4]],
[[5, 6], [7, 8]]
[[ ])
Though this code may look similar to Python lists, the resulting objects, vector
, matrix
, and cube
, are all NumPy arrays—complete with properties that we explore next.
15.4.2 Inspecting Array Attributes
NumPy arrays come with built-in attributes to help you quickly understand their layout and the data they hold:
= np.array([[1, 2, 3],
arr 4, 5, 6]])
[
print(arr.shape) # (2, 3) - tuple of dimensions
print(arr.ndim) # 2 - number of dimensions
print(arr.size) # 6 - total number of elements
print(arr.dtype) # int64 - data type of elements
shape
: The dimensions of the array (rows × columns in 2D).
ndim
: How many dimensions the array has (2D, 3D, etc.).
size
: Total number of elements in the array.
dtype
: The type of each element (e.g.,float64
,int64
).
15.4.2.1 Why Data Types (dtype
) Matter
NumPy arrays are homogeneous, meaning all elements share the same type. In economics or finance, you often deal with floating-point data (e.g., real numbers), so your arrays might typically have float64
as their dtype
. However, if you need to store only integers (for instance, integer-coded categories), you could use int64
. This can save memory and make certain computations more consistent.
15.5 Special Array Constructors
NumPy provides convenient functions for constructing arrays without having to manually type out lists:
Zeros and Ones
Useful for initializing arrays of a given shape with all zeros or ones:= np.zeros((3, 4)) # 3 rows, 4 columns of zeros zeros = np.ones((2, 2)) # 2 rows, 2 columns of ones ones
Empty
Creates an uninitialized array. Its initial content is arbitrary, so it is usually used when you plan to fill the array later:= np.empty((2, 3)) empty
Identity Matrix
Commonly needed in linear algebra (e.g., an identity matrix in regressions):= np.eye(3) # 3x3 identity I
Ranges
NumPy’s version of Python’srange()
isnp.arange()
, which generates a sequence of values:= np.arange(0, 10, 2) # [0, 2, 4, 6, 8] range_array
These constructors help you allocate arrays for typical operations in economics and econometrics, such as setting up design matrices for regressions, placeholders for iterative algorithms, or identity matrices for transformations.
15.6 Reshaping and Transposition
A crucial feature of NumPy arrays is their shape manipulation, which allows you to reorganize data easily.
= np.array([1, 2, 3, 4, 5, 6])
arr = arr.reshape(2, 3) matrix
Here, a one-dimensional array with 6 elements is reshaped into a 2×3 matrix. Internally, NumPy does not copy data if it’s not necessary; it simply changes how the data are viewed.
15.6.0.1 Transposition
For a 2D array, you can transpose rows and columns:
= np.array([
matrix 1, 2, 3],
[4, 5, 6]
[
])= matrix.T transposed
Transposition is useful in linear algebra (e.g., to compute \(\mathbf{X}^\top \mathbf{X}\) in a regression).
15.6.0.2 Expand and Squeeze Dimensions
Sometimes you must expand or reduce dimensions to meet the needs of a function:
= np.array([1, 2, 3]) # shape (3,)
vector = np.expand_dims(vector, axis=1) # shape (3, 1)
col_vec = np.expand_dims(vector, axis=0) # shape (1, 3) row_vec
Adding or removing “axes” in your arrays is common when working with broadcasting or advanced data manipulation routines.
15.7 Indexing and Slicing
15.7.1 Basic Indexing
Indexing in NumPy is similar to indexing in Python lists, but extended to multiple dimensions. If arr
is 2D, arr[i, j]
refers to the element in the \(i\)-th row and \(j\)-th column. For example:
= np.array([
matrix 10, 20, 30],
[40, 50, 60]
[
])
print(matrix[0, 1]) # 20
print(matrix[1, 2]) # 60
Indexing is essential for retrieving or modifying specific elements of your data.
15.7.2 Slicing
Slicing allows you to select subregions (contiguous blocks) of the array without copying data unnecessarily. The syntax is [start:stop:step]
.
1D slicing:
= np.array([0, 1, 2, 3, 4, 5]) arr print(arr[1:4]) # [1, 2, 3] print(arr[::2]) # [0, 2, 4]
2D slicing:
= np.array([ matrix 1, 2, 3], [4, 5, 6], [7, 8, 9] [ ])= matrix[0:2, 1:3] # rows 0 & 1, columns 1 & 2 submatrix # result: [[2, 3], # [5, 6]]
Slicing is a powerful tool for extracting or altering data subsets, e.g., selecting the first 100 observations, focusing on specific columns in a dataset, or partitioning time-series data into separate intervals.
15.7.2.1 Views vs. Copies in Slices
Crucially, slices often return “views,” not copies. Altering a slice can change the original array. If you truly need an independent piece of data, you should explicitly copy:
= matrix[0:2, :]
slice_view = matrix[0:2, :].copy() slice_copy
15.7.3 Boolean Indexing
Boolean indexing allows you to filter elements based on some logical condition. For instance, if you have an array of returns and you want to extract only positive ones:
= np.array([-0.02, 0.01, 0.04, -0.01])
returns = (returns > 0)
mask = returns[mask] # array([0.01, 0.04]) pos_returns
This is analogous to “select where returns are positive,” an operation fundamental to data cleaning and outlier detection in empirical analysis.
15.7.4 Fancy Indexing
Fancy indexing allows you to pull elements by specifying an array of integer indices:
= np.array([10, 20, 30, 40, 50])
arr = np.array([0, 3, 4])
indices = arr[indices] # array([10, 40, 50]) selected
This approach is useful when you already have identified the positions of special elements you want to extract—like matching certain time points or flagged observations.
15.8 Broadcasting
Broadcasting is one of NumPy’s most powerful (and initially non-intuitive) features, allowing you to perform arithmetic on arrays of different shapes.
15.8.1 Scalar Broadcasting
If you add a scalar to an array, NumPy “broadcasts” the scalar to match the array’s shape. That is, it treats the scalar as if it were an array of the same shape and type:
= np.array([1, 2, 3])
arr print(arr + 5) # [6, 7, 8]
Conceptually, 5
became [5, 5, 5]
under the hood.
15.8.2 Array Broadcasting
Two arrays with different shapes can still be compatible if one dimension can be repeated (“broadcast”) to match the other. For instance:
= np.array([
matrix 1, 2, 3],
[4, 5, 6]
[# shape (2, 3)
])
= np.array([10, 20, 30]) # shape (3,)
vector
= matrix + vector
result # shape (2, 3), done row-wise:
# [[11, 22, 33],
# [14, 25, 36]]
In economic modeling or simulation contexts, broadcasting allows a concise expression of operations like adding a constant inflation term across all goods or all time periods, or applying a single coefficient vector to multiple data points in a matrix. The NumPy documentation describes broadcasting rules in more detail, but at a high level:
- NumPy compares the arrays dimension by dimension from right to left.
- They are considered compatible in a dimension if they are the same size, or if one has size 1 (which can be stretched).
- If the arrays differ in a dimension where neither is 1, broadcasting fails, resulting in an error.
15.9 Elementwise and Aggregation Functions
15.9.1 Elementwise Functions
NumPy provides vectorized mathematical functions that act on whole arrays at once:
= np.array([0, np.pi/2, np.pi])
arr
print(np.sin(arr)) # [0.0, 1.0, 1.2246e-16]
print(np.exp(arr))
print(np.sqrt(arr))
No loops are needed, and computations are efficient. This design underpins many advanced data analysis libraries (like pandas, statsmodels, and more) that build upon NumPy.
15.9.2 Aggregation Functions
Aggregation combines elements of an array into a single value or a set of values:
= np.array([
matrix 1, 2, 3],
[4, 5, 6]
[
])print(np.sum(matrix)) # 21
print(np.mean(matrix)) # 3.5
print(np.max(matrix)) # 6
Additionally, you can specify an “axis” along which to aggregate:
axis=0
: Aggregate down the columns.
axis=1
: Aggregate across the rows.
print(np.sum(matrix, axis=0)) # [5, 7, 9]
print(np.sum(matrix, axis=1)) # [6, 15]
In economics, summing across rows might correspond to summing variables across regions; summing across columns might correspond to summing observations across time.
15.10 Linear Algebra
Many standard linear algebra routines are available in np.linalg
, including matrix multiplication, inversion, and eigenvalue decomposition:
= np.array([[1, 2],
a 3, 4]])
[= np.array([[5, 6],
b 7, 8]])
[
# Matrix multiplication
= np.dot(a, b)
c # or equivalently
= a @ b
c_alt
# Inverse
= np.linalg.inv(a)
inv_a
# Solve Ax = b
= np.array([1, 2])
b_vec = np.linalg.solve(a, b_vec) x
These are cornerstones of econometric operations (e.g., OLS regressions often involve matrix multiplication and inversion) and provide a direct path to handle fundamental tasks in quantitative modeling.
15.11 Comparisons and Logical Operations
NumPy supports elementwise and arraywise comparisons:
= np.array([1, 2, 3])
arr1 = np.array([2, 2, 1])
arr2
print(arr1 == arr2) # [False, True, False]
print(np.array_equal(arr1, arr2)) # False
Such operations make it straightforward to identify matching records, detect missing or invalid data, and implement condition-based logic (e.g., “select all rows where GDP > 1000”).
15.12 Random Numbers
Random numbers are indispensable for Monte Carlo experiments, bootstrapping, or stochastic simulations in economic modeling. NumPy offers a dedicated submodule, np.random
, with many useful functions:
42) # for reproducibility
np.random.seed(
# Uniform distribution
= np.random.rand(3, 3)
sample_uniform
# Normal distribution
= np.random.normal(0, 1, 1000)
sample_normal
# Integer random numbers
= np.random.randint(0, 10, 5) sample_ints
By specifying a seed, you ensure reproducible results, which is vital for research transparency and consistent results across runs.
15.13 Sorting and Searching
Sorting is important when you need data ordered by time, magnitude, or any other metric:
= np.array([3, 1, 4, 1, 5, 9])
arr = np.sort(arr) # returns a sorted copy sorted_arr
You can also sort along axes in 2D arrays:
= np.array([[3, 1, 4], [1, 5, 9]])
matrix print(np.sort(matrix, axis=0)) # Sort each column
print(np.sort(matrix, axis=1)) # Sort each row
Searching utilities, such as np.argsort
(to get the indices that would sort an array) or np.searchsorted
(binary search in a sorted array), are especially relevant for time-series alignment or indexing events chronologically.
15.14 Saving and Loading Data
Because economic and machine learning analyses often involve large datasets or repeated computations, saving and loading arrays in an efficient format is critical:
= np.array([[1, 2, 3],
data_2d 4, 5, 6]])
['array_2d.csv', data_2d, delimiter=',')
np.savetxt(
# Reload from CSV
= np.loadtxt('array_2d.csv', delimiter=',') loaded_data_2d
For higher-dimensional data, CSV may not be straightforward (you need to reshape). NumPy’s binary format (.npy
) handles any dimension without hassle:
# Save to .npy
'array_3d.npy', data_2d)
np.save(
# Load
= np.load('array_3d.npy') restored
Using .npy
or .npz
(zipped) files is faster for large data, preserves data types, and avoids CSV’s limitations.
15.15 Concluding Remarks
NumPy arrays are at the core of fast, efficient data management in Python. For economists who are accustomed to thinking in terms of vectors and matrices, NumPy provides a natural programming interface that extends seamlessly to higher dimensions when needed. Key points to remember:
Concept of Contiguous Data
NumPy arrays store data in contiguous memory blocks, enabling efficient computation and vectorized operations.Shape Manipulation and Indexing
You can reshape arrays, slice them, and index them in a variety of ways to work with exactly the portion of your data you need.Broadcasting
This feature facilitates concise and powerful arithmetic on arrays of differing shapes, which is particularly handy in model building and simulation.Linear Algebra Integration
NumPy provides well-optimized routines for matrix operations, eigenvalues, and other standard linear algebra procedures, ubiquitous in econometrics.Data Persistence
Arrays can be saved and loaded quickly, making repeated analyses or large simulations manageable.
With a firm grasp of these fundamentals, you will find it easier to adopt more advanced data tools, such as pandas for data frames, statsmodels for econometric analysis, or scikit-learn for machine learning. As you proceed, keep these principles in mind: NumPy’s efficiency is often the foundation upon which entire analytic pipelines are built. By leveraging arrays effectively, you will set yourself up for success in handling large datasets, running complex models, and ultimately deriving more insights from your data.